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Abstract 

This paper derives an identity connecting the square loss of ridge re- 
gression in on-hne mode with the loss of the retrospectively best regres- 
sor. Some corollaries about the properties of the cumulative loss of on-line 
ridge regression are also obtained. 



1 Introduction 

Ridge regression is a powerful technique of machine learning. It was introduced 
in |Hoe62j ; the kernel version of it is derived in jSGV98j . 

Ridge regression can be used as a batch or on-line algorithm. This paper 
proves an identity connecting the square losses of ridge regression used on the 
same data in batch and on-line fashions. The identity and the approach to the 
proof are not entirely new. The identity implicitly appears in [AWOlj for the 
linear case (it can be obtained by summing (4.21) from |AW01| in an exact rather 
than estimated form). One of the proof methods based essentially on Bayesian 
estimation features in |KSF05| . which focuses on probabilistic statements and 
stops one step short of formulating the identity. In this paper we put it all 
together, explicitly formulate the identity in terms of ridge regression, and give 
two proofs for the kernel case. The first proof is by calculating the likelihood 
in a Gaussian processes model by different methods. It is remarkable that a 
probabilistic proof yields a result that holds in the worst case along any sequence 
of signals and outcomes with no probabilistic assumptions. The other proof is 
based on the analysis of a Bayesian-type algorithm for prediction with expert 
advice; it is reproduced from unpublished technical report |ZV09) . 



*An earlier version of this paper appeared in Proceedings of ALT 2010, LNCS 6331, 
Springer, 2010. This paper also reproduces some results from technical report IZV09| . 
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We use the identity to derive several inequalities providing upper bounds 
for the cumulative loss of ridge regression applied in the on-line fashion. Corol- 
laries m and 13] deal with the 'clipped' ridge regression. The later reproduces 
Theorem 4.6 from |AW01) (this result is often confused with Theorem 4 in 
|Vov01| . which, in fact, provides a similar bound for an essentially different al- 
gorithm). Corollary m shows that for continuous kernels on compact domains 
the loss of (undipped) on-line ridge regression is asymptotically close to the 
loss of the retrospectively best regressor. This result cannot be generalised to 
non-compact domains. 

In the literature there is a range of specially designed regression-type algo- 
rithms with better worst-case bounds or bounds covering wider cases. Aggregat- 
ing algorithm regression (also known as Vovk-Azoury-Warmuth predictor) is de- 
scribed in [VovOlj . |AW01| . and Section 11.8 of [CBLOGI . Theorem 1 in |Vov01) 
provides an upper bound for aggregating algorithm regression; the bound is bet- 
ter than the bound given by Corollary[3]for clipped ridge regression. The bound 
from |Vov01| has also been shown to be optimal in a strong sense. The exact 
relation between the performance of ridge regression and the performance of ag- 
gregating algorithm regression is not known. Theorem 3 in jVovOlj describes a 
case where aggregating algorithm regression performs better, but in the case of 
unbounded signals. An important class of regression-type algorithms achieving 
different bounds is based on the gradient descent idea; see |CBLW96] , |KW97j , 
and Section 11 in |CBL06j . In |HW01| and |BK07| regression- type algorithms 
dealing with changing dependencies are constructed. In |CZ10) regression is 
considered within the framework of discounted loss, which decays with time. 

The paper is organised as follows. Section [2] introduces kernels and kernel 
ridge regression in batch and on-line settings. We use an explicit formula to 
introduce ridge regression; Appendix |^ contains a proof that this formula spec- 
ifies the function with a certain optimality property. Section [3] contains the 
statement of the identity and Subsection 13.11 shows that the identity remains 
true (in a way) for the case of zero ridge. 

Section 2] discusses corollaries of the identity. Section [S] contains the proof 
based on a probabilistic interpretation of ridge regression in the context of Gaus- 
sian fields. Section [5] contains an alternative proof based on prediction with 
expert advice. The proof has been reproduced from |Z V09) . 

Appendixes EVfCl to the paper contain proofs of some known results; they 
have been included for completeness and to clarify the intuition behind other 
proofs in the paper. 

2 Kernel Ridge Regression in On-line and Batch 
Settings 

2.1 Kernels 

A kernel on a domain X , which is an arbitrary set with no structure assumed, is a 
symmetric positive-semidefinite function of two arguments, i.e., IC : X x X M. 



2 



such that 



1. for all xi,X2 € X we have )C{xi,X2) = IC{x2,xi) and 

2. for any positive integer T, any xi, X2, ■ ■ ■ , xt G X and any real numbers 



An equivalent definition can be given as follows. A function /C : X x X — >■ R is 
a kernel if there is a Hilbert space T of functions on X such that 

1. for every x ^ X the function IC{x, •), i.e., /C considered as a function of 
the second argument with the first argument fixed, belongs to J- and 

2. for every x ^ X and every f E J- the value of / at a; equals the scalar 
product of / by /C(x, •), i.e., f{x) = (/, /C(2:, •))jr; this property is often 
called the reproducing property. 

The second definition is sometimes said to specify a reproducing kernel. The 
space J- is called the reproducing kernel Hilbert space (RKHS) for the kernel K. 
(it can be shown that the RKHS for a kernel K. is unique). The equivalence of 
the two definitions is proven in |Aro43j . 

2.2 Regression in Batch and On-line Settings 

Suppose that we are given a sample of pairs 



where all Xt E X are called signals and j/t S M are called outcomes (or labels) 
for the corresponding signals. Every pair (xt,yt) is called a (labelled) example. 

The task of regression is to fit a function (usually from a particular class) 
to the data. The method of kernel ridge regression with a kernel /C and a 
real regularisation parameter [ridge) a > suggests the function f-RB.{x) = 



ai, a2, . . . , ax G M we have j=i ^{xi, Xj)aiaj > 0. 



S = {{xi,yi), {x2,y2), • ■ • , [xT^yr)) 




\IC{xT,Xi) 



JC{XT,X2) 



JC{xt,xt)/ 



is the kernel matrix and 



k{x) 



I K.{xi,x)\ 

JC{X2,X) 



\IC{XT,X)/ 



^Throughout this paper M' denotes the transpose of a matrix M. 
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Note that the matrix K is positive-seniidefinite by the definition of a kernel, 
therefore the matrix K -\- al is positive-definite and thus non-singular. 

If the sample S is empty, i.e., T = or no examples have been given to us 
yet, we assume that fBJi(x) = for all x. 

It is easy to see that /RR(a;) is a linear combination of functions IC{xt,x) 
(note that x does not appear outside of k{x) in the ridge regression formula) 
and therefore it belongs to the RKHS T specified by the kernel /C. It can be 
shown that on /rr the minimum of the expression Y^J^iif (xt) — J/t)^ + 
(where |j • |jjr is the norm in T) over all / from the RKHS T is achieved. For 
completeness, we include a proof in Appendix 1X1 

Suppose now that the sample is given to us example by example. For each 
example we are shown the signal and then asked to produce a prediction for 
the outcome. One can say that the learner operates according to the following 
protocol: 

Protocol 1. 

for i = 1,2,... 

read signal Xt 

output prediction 74 

read true outcome yt 
endf or 

This learning scenario is called on-line or sequential. The scenario when the 
whole sample is given to us at once as before is called batch to distinguish it 
from on-line. 

One can apply ridge regression in the on-line scenario in the following nat- 
ural way. On step t we form the sample St-i from the i — 1 known examples 
(a;i, j/i), (0:2, 2/2), ■ • • , (a;*-!, 2/t-i) and output the prediction suggested by the 
ridge regression function for this sample. 

For the on-line scenario on step t we will use the same notation as in the 
batch mode but with the index t — 1 denoting the timc0. Thus Kt-i is the 
kernel matrix on step t (its size is (i — 1) x (t — 1)), Yt-i is the vector of 
outcomes yi, ?/2, ■ • ■ , Vt-i, and h-i (xt) = {IC{xi, Xt), IC{x2, Xt), . . . , IC{xt-i, Xt))' 
is k{xt) for step t. We will be referring to the prediction output by on-line ridge 
regression on step t as 7^^^. 

3 The Identity 

Theorem 1. Take a kernel JC on a domain X and a parameter a > 0. Let T 
he the RKHS for the kernel JC. For a sample {xi,yi), {x2, 1/2), ■■■ AxtjUt) 1st 
7i^^j 72*^^1 ■ • • J 7t^ be the predictions output by ridge regression with the kernel 

■^The conference version of this paper used t rather than t — 1. This paper uses t — 1 because 
it coincides with the size and for compatibiUty with earlier papers. 
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/C and the parameter a in the on-line mode. Then 

where dt = IC{xt,Xt) — kf_i{xt){Kt-i+aI)^^kt-i{xt) > and a// oi/ier notation 
is as above. 

The left-hand side term in this equahty is dose to the cumulative square loss 
of ridge regression in the on-line mode. The difference is in the denominators 
1 -|- dt/a. The values dt have the meaning of variances of ridge regression pre- 
dictions according to the probabilistic view discussed below. Lemma [T] shows 
that dt — > as < — oo for continuous kernels on compact domains. The terms 
of the identity thus become close to the cumulative square loss asymptotically; 
this intuition is formalised by Corollary [2] 

Note that the minimum in the middle term is attained on / specified by 
batch ridge regression knowing the whole sample. It is thus nearly the square 
loss of the retrospectively best fit / e J^. 

The right-hand side term is a simple closed-form expression. 



3.1 The Case of Zero Ridge 

In this subsection we show that the identity essentially remains true for a = 0. 

Let the parameter a in the identity approach 0. One may think that the 
third term of the identity should tend to zero. On the other hand, the value of 
the middle term of the identity for a = depends on Yt; the values of yt can be 
chosen (at least in some cases) so that there is no exact fit in the RKHS (i.e., 
no / G such that f{xt) = yt, t = 1,2, ... ,T) and the middle term is not equal 
to 0. This section resolves the apparent contradiction. 

As a matter of fact, the limit of the identity as a — > does not have to be 0. 
The situation when there is no exact fit in the RKHS is only possible when the 
matrix Kt is singular, and in this situation the right-hand side does not always 
tend to 0. 

The expression on the left-hand side of the identity is formally undefined for 
a = 0. The expression on the right-hand side is undefined when a — and Kt 
is singular. The expression in the centre, by contrast, always makes sense. The 
following theorem clarifies the situation. 

Corollary 1. Under the conditions of TheoremUl as a — > 0, the terms of the 
identity 

tend to the squared norm of the projection of the vector Yt to the null space of 
the matrix Kt ■ This coincides with the value of the middle term of the identity 
for a — Q. 
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The null space (also called the kernel) of a matrix S is the space of vectors 
X such that Sx = 0. It is easy to see that the dimension of the null space and 
the rank of S (equal to the dimension of the span of the columns of S) sum up 
to the number of columns of S. If, moreover, S is square and symmetric, the 
null space of S is the orthogonal complement of the span of the columns of S. 

Proof. For every a > let rua = inf/G^ {j2f=iifi^t) - yt)^ + • Propo- 

sition[2]implies that if a > then the infimum is achieved on the ridge regression 
function with the parameter a. Throughout this proof we will denote this func- 
tion by fa- 

Let us calculate the value of mg = inf /gjF J2t=iifi^t) ~ Vt)"^- I* follows from 
the representer theorem (see Proposition that it is sufficient to consider the 
functions / of the form /(•) = X^^i CilC{xi, •). 

For /(•) = Z^Li Ci^(a;t, •) the sum Y.J=iifi^t) - TJtY equals the squared 
norm \\KtC — YtW^ , where C = {ci,C2, . ■ . ,Ct)' is the vector of coefficients 
of the linear combination. If Co minimises this expression, then KtCq is the 
projection of Yt to the linear span of the columns of Kt. The vector Yr — KtCq 
is then the projection of Yt to the orthogonal complement of the span and the 
orthogonal complement is the null space of Kt. 

Let us show that is continuous at a = 0. Fix some fo such that the 
infimum mp is achieved on fo (if Kt is singular there can be more than one such 
function). Substituting /o into the formula for rria yields rria < mo + fl||/o|l3=' — 
mo + o(l) as a — >■ 0. Substituting fa into the definition of mo yields too < ma- 
We thus get that Wa — >■ too as a — >■ 0. □ 



4 Corollaries 

In this section we use the identity to obtain some properties of cumulative losses 
of on-line algorithms. 



4.1 A Multiplicative Bound 

It is easy to obtain a basic multiplicative bound on the loss of on-line ridge 
regression. The matrix {Kt^i + al)^^ is positive-definite as the inverse of 
a positive-definite, therefore ki_i{xt){Kt^i + al)^^kt^i{xt) > and dt < 
JC{xt, Xt). Assuming that there is cjr > such that /C(a:, x) < on X (i.e., the 
evaluation functional on is uniformly bounded by cjf), we get 




-'Yt . (1) 
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4.2 Additive Bounds for Clipped Regression 

Some less trivial bounds can be obtained under the following assumption. Sup- 
pose that we know in advance that outcomes y come from an interval [— K], 
and Y is known to us. It does not make sense then to make predictions outside 
of the interval. One may consider clipped ridge regression, which operates as 
follows. For a given signal the ridge regression prediction is calculated; if 
it falls inside the interval, it is kept; if it is outside of the interval, it is replaced 
by the closest point from the interval. Denote the prediction of clipped ridge 
regression by 7^^'^. If y € [-^^,^1 indeed holds, then (7^^'^-?;)^ < (7^^-?/)^ 
and [^^^^y -yf < 4^2. 

Corollary 2. Take a kernel K, on a domain X and a parameter a > 0. Let 
J- he the RKHS for the kernel IC. For a sample (xi, t/i), (a;2, 2/2), ■ • ■ , (a^T, 2/t) 
such that yt G [~Y, Y] for all t — 1,2, ... ,T , let ' ,72 ' , . . . , 7-^ ' be 
the predictions output by clipped ridge regression with the kernel K. and the 
parameter a in the on-line mode. Then 

T 

[it ~Vt) < 

t=l 

fi^^i^y^^t) - ytf + a\\ff^ +4y2indct(^/+ixT) , 



where Kt is as above. 
Proof. We have 

and 



1 dt/a 



1 + dt/a I + dt/a 

dt/a 



<\n{l + dt/a) ; 



1 + dt/a 

indeed, for 6 > the inequality b/ (1 + 6) < ln(l + b) holds and can be checked 
by differentiation. Therefore 



T T , T 



E/ RR.y n2 y^/ br,y n2 1 I Y^/ nn,Y -.2 ^t/a. 
(7* - m) - ^ht - yt) + - yt) YTZTa 

t=i t=i ' t=i ' 



T T 



Lemma [3] proved below yields 

^ 1 /I 

Y[{1 + dt/a) = — det(is:T + al) = dot f / + -Kt 

t=i \ 



□ 



7 



There is no sub-linear upper bound on the regret term 



In det \^I+-Kt 
in the general case; indeed, consider the kernel 



) 



( 



1 if xi = X2; 
otherwise. 



(2) 



However we can get good bounds in special cases. 

It is shown in |SKF08) that for the Gaussian kernel IC{xi,X2) = e^''^^^^'^^^' , 
where xi,X2 & M'', we can get an upper bound on average. Suppose that all xs 
axe independently identically distributed according to the Gaussian distribution 
with the mean of and variance of cl. Then for the expectation we have 
i;indet (I+^Kt) = 0((lnT)'^+i) (see Section IV.B in |SKF08j '). This yields 
a bound on the expected loss of clipped ridge regression. 

Consider the linear kernel IC{xi,X2) = x'iX2 defined on column vectors from 
K". We have IC{x,x) — WxW^, where || • || is the quadratic norm in R". The 
reproducing kernel Hilbert space is the set of all linear functions on R". We 
have Kt = X^Xt, where Xt is the design matrix made up of column vectors 
xi,X2t ■ ■ , XT- The Sylvester determinant identity det(/ + UV) — det(/ + VU) 
(see, e.g., |HS81j . Eq. (6)) implies that 



We will use an upper bound from |CBCG05] for this determinant; a proof is 
given in Appendix [C] for completeness. We get the following corollary. 

Corollary 3. For a sample (xi, j/i), (x2,y2), (xt^Vt), where \\xt\\ < B and 



yt e [-Y, Y] for all t ^ 1,2, ... ,T , let 7^ 7^^''*', . . . , J^^'^ predic- 



tions output by clipped linear ridge regression with a parameter a > in the 
on-line mode. Then 



It is an interesting problem if the bound is optimal. As far as we know, there 
is a gap in existing bounds. Theorem 2 in |Vov01| shows that Y^n In T is a lower 
bound for bk?/ learner and in the constructed example, ||a:t||oo = 1- Theorem 3 in 
|Vov01| provides a stronger lower bound, but at the cost of allowing unbounded 
xs. 

4.3 An Asymptotic Comparison 

The inequalities we have considered so far hold for finite time horizons T. We 
shall now let T tend to infinity. 




det / + -XtXt = det / + - 



a / \ a 
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Let us analyse the behaviour of the quantity 

dt = IC{xt,Xt) - k[_j^{xt){Kt-i + aiy^kt^i{xt) . 

According to the probabilistic interpretation discussed in Subsection 15. 11 dt has 
the meaning of the variance of the prediction output by kernel ridge regression 
on step t and therefore it is always non-negative. 

The probabilistic interpretation suggests that the variance should go down 
with time as we learn the data better. In the general case this is not true. 
Indeed, ii K, = S defined by ^ and all different then dt = I for all t = 

1,2,... However under natural assumptions that hold in most reasonable cases 
the following lemma holds. The lemma generalises Lemma A.l from [KTT09) 
because for the linear kernel dt can be can be represented as shown in (|14p 
below. 

Lemma 1. Let X be a compact metric space and a kernel K. : — > R 6e 
continuous in both arguments. Then for any sequence xi,X2, ■ ■ . G X and a > 
we have c?( — > as < — > cx). 

Proof. As discussed in Subsection 15. 1[ dt has the meaning of a dispersion under 
a certain probabilistic interpretation and therefore dt > 0. One can easily see 
that k[_i{xt){Kt-i + al)^^kt-i{xt) > 0. Indeed, the matrix {Kt-i + al)~^ is 
positive-definite as the inverse of a positive-definite. We get 

0<k[^^{xt){Kt-i+aI)-^kt-i{xt)<IC{xt,xt) . (3) 

Let us start by considering a special case. Suppose that the sequence 
xi,X2,.-. converges. Let limt^oo Xt — xq Cz X. The continuity of K. implies 
that IC{xt,Xt) — >■ IC{xo, xq) and 

< k't^^{xt){Kt-i + al)-^kt^i{xt) < JC{xQ,xo) + o(l) (4) 

as t — > oo. The positive-semidefiniteness clause in the definition of a kernel 
implies that IC{xo,xo) > 0. We will obtain a lower bound on kf_i{xt){Kt-i + 
al)~^kt-i{xt) and show that it must converge to /C(a;o,a;o) thus proving the 
lemma in the special case. 

For every symmetric positive-semidefinite matrix M and a vector x of the 
matching size we have Aminll^^lP < x' Mx, where Amin > is the smallest eigen- 
value of M (this can be shown by considering the orthonormal base where M 
diagonalises). The smallest eigenvalue of [Kt-i + al)~^ equals l/(A-|-a), where 
A is the largest eigenvalue of Kt-i- The value of A is bounded from above by 
the trace of Kt-i'. 

t-i 

A < y^/C(a;,,Xj) 
1=1 

and this yields a lower bound on the smallest eigenvalue of {Kt-i + al)~^. 
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The squared norm of kt-i equals 



|fct_i(xt)|P=^(/C(x„a;0)' 



i=l 



Combining this with the above estimates we get 



Let us assume K. (xq ,xo) 7^ and show that the right-hand size of the inequahty 
tends to IC{xo,xo) (if JC{xo,xo) — 0, then (g]) implies that k[_-j^{xt){Kt^i + 
al)~^kt-i{xt) — ^ 0). Dividing the numerator and the denominator by i — 1 
yields 



a + J2i=i ^{xi,Xi) 



Clearly, as t goes to infinity, most terms in the sums become arbitrary close 
to {K{xq,Xq)Y and 1C{xq,Xq) and thus the 'averages' tend to (/C(a:o, Xq))^ and 
/C(xo, Xq), respectively. Therefore the fraction tends to /C(xo, Xq) and ^ implies 
that k[_i{xt){Kt-i + al)^^kt-i{xt) JC{xo,xq). We have shown that — > 
for the special case when the sequence xi, a;2, . . . converges. 

Let us prove the lemma for an arbitrary sequence xi, X2, . . . (z X. If y^- 0, 
there is a subsequence of indexes n < r2 < . . . < < . . . such that dr^ 
is separated from 0. Since X is compact, we can choose a sub-subsequence 
ti < t2 < ■ ■ ■ < tn < ■ ■ ■ such that dt„ is separated from and Xt„ converges. If 
we show that dt^ — > 0, we get a contradiction and prove the lemma. Thus it is 
sufficient to show that dt„ — > where lim„_j.oo Xt,^ = xq (z X . 

Clearly, we have inequalities ([4]) for i = i„: 

0<kl_,{xtJ{Kt^-i+aiy^kt^^i{xtJ<IC{xo,xo) + o{l) . (5) 

We proceed by obtaining a lower bound on the middle term as before. 

Fix some n and the corresponding t„. One can rearrange the order of the 
elements of the finite sequence xi,X2, ■ ■ ■ ,xt,^-i to put the elements of the 
subsequence to the front and consider the sequence (still of length i„ — 1) 
Xt-i,xt2, ■ ■ ■ , a;t„_i , xi,X2, ■ ■ ■ , Xt,^-m where xi,X2, ■ ■ ■ , Xt„-n are the elements of 
the original sequence with indexes not in the set {ti, t2, . ■ . , in-i}- 

One can write 



kl_,{xO{Kt„ 
where 



-i+al) ^kt^^i{xtj = ~k't^_i{xtJ{Kt 
( JC{xt^,xtJ \ 



JC{xt^_^,xt^ 
JC{xi,Xt„) 

\IC{xt^-n,XtjJ 
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and 



/ /C(a;ti,xtj ... /C(a;ti,Xt„_J lC{xt^,xi) ... /C(xti , Xt„_„) \ 
/C(xi,xtJ 

\^(^t,.-", a^tl) •■• A:(xt„_„,Xt„_J /C(xt„_„,Xi) ... ]C{xt^-.n,Xt^-n) j 

Indeed, we can consider the matrix product kt^-i{xt^){Kt^-i+aI)~^kt^-i{xt^) 
in the rearranged orthonormal base where the base vectors with the indexes 
ti,t2,. . . ,tn-i are at the front of the hst. (Alternatively one can check that 
rearranging the elements of the training set does not affect ridge regression 
prediction and its variance.) 

The upper left corner of -ft't,,-! and the upper part of kt^^i{xt^) consist of 
values of the kernel on elements of the subsequence, /C(a;t. ,xt-),i,j = 1,2, ... ,n. 
We shall use this observation and reduce the proof to the special case considered 
above. _ 

Let us single out the left upper corner of size (n—l) x (n— 1) in Kt^^-i+al and 
apply Lemma[8]from AppendixO The special case considered above implies that 
kt„-ii^t„){Kt„-i + al)~^kt^-i{xt„) > JC{xo,xo) + o(l) as rt -> oo. Combined 
with ^ this proves that dt^ — >■ as n — ^ 0. 

□ 

We shall apply this lemma to establish asymptotic equivalence between the 
losses in on-line and batch cases. The following corollary generalises Corollary 3 
from [ZVOQ) . 

Corollary 4. Let X be a compact metric space and a kernel K. : — > R 6e 
continuous in both arguments; let J- be the RKHS corresponding to the kernel K,. 
For a sequence {xi, yi), {x2, 2/2), . . . E X x R let -f^^ be the predictions output 
by on-line ridge regression with a parameter a > 0. Then 

1. if there is f E T such that X]t^i(yt ~ f{xt)Y < then 

00 

t=i 



JC{xt^_ 
JC(xi, 



Xt„-^) 



K.{xt^_^,Xi) 
]C{xi,Xi) 



JC{xi,Xt„^n) 



2. if for all f E J- we have YlTLiiVt — f{xt)Y = +00, then 

lim Y.tM-ir? 

min;,^ i^tM - + 4f\\y) 
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Proof. Part 1 follows from bound ((J). Indeed, the continuous function JC{x,x) 
is uniformly bounded on X and one can take a finite constant cjr. 

Let us prove Part 2. We observed above that dt > 0- The identity implies 



and thus the fraction in ^ is always greater than or equal to 1. 

Let TOf — min/gjF (^X]t=i(yT ~ /(^r))^ + The sequence mj is non- 

decreasing. Indeed let the minimum in the definition of rrit be achieved on ft- If 
mt+i < rrit then one can substitute ft+i into the definition of mt and decrease 
the minimum. 

Let us prove that mt — >'+ooast— >-oo. A monotonic sequence must have a 
limit; let limt_>.oo mt = m^o- We have mi < m2 < ■ • ■ < m^o- We will assume 
that TTioo < +0O and show that there is /oo S such that X]t^i(y* ~/oo(^t))^ ^ 
Woo < +00 contrary to the condition of Part 2. 

Proposition [5] implies that /t(-) is the ridge regression function and belongs 
to the linear span of IC{xi, ■), i — 1,2, . . . ,t, which we will denote by Xt- (For 
uniformity let Xq — and mo = 0.) The squared norm of ft does not exceed 
mt/a < moo/a. Thus all /( belong to the ball of radius -y/moo/a centred at the 
origin. 

Let Xao be the closure of the linear span of IJt^o ■ -^0° happens to have 
a finite dimension, then all ft belong to a ball in a finite-dimensional space; this 
ball is a compact set. If Xoo is of infinite dimension, the ball is not compact 
but we will construct a different compact set containing all ft- 

Take < s < t. The function ft can be uniquely decomposed as ft = g + h, 
where g belongs to Xg and h is orthogonal to Xg- The Pythagoras theorem 
implies that ||/||^ = llffllj^' + ll'^llj;-- The function h{-) is orthogonal to all /C(x,;, •), 
i = 1,2, . . . ,s; thus h{xi) = {h,IC{xi, •)) = and ft{xi) g{xi), i = 1,2, . . . ,s 
(recall the proof of the representer theorem. Proposition [3]). Note that g cannot 
outperform fs, which achieves the minimum m^. We get 

t 

= E(y* ~ ft{xt)f + aWftfjr 
1=1 

s t 

= J2{y,-g{x.,)f+a\\g\\'^+ ^ {y,-ft{x,))'+a\\h\\'^ 

i—1 i—s-\-l 



This inequality implies that ||/i||3r < {mt — ms)/a < (moo — 'rns)la. 

Consider the set B of functions / e X^o Q ^ satisfying the following prop- 
erty for every s = 0,1,2,...: let / — g + h be the unique decomposition 
such that g e Xg and h is orthogonal to Xg', then the norm of h satisfies 

\\h\\jr < (moo - ms)/a. 
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We have shown that all ft belong to B, t — 1,2,... Let us show that B is 
compact. It is closed because the projection in Hilbert spaces is a continuous 
operator. Let us show that B is totally bounded. We shall fix e > and 
construct a finite e-net of points in B such that B is covered by closed balls of 
radius s centred at the points from the net. 

There is s > such that (tooo — ms)/a < because rus — >■ TOoo- The ball 
of radius {rriao — TnQ)/a in Xg is compact and therefore it contains a finite 
e/-\/2-net gi, g2, ■ ■ ■ , gk- Every f & B can be represented as f — g + h, where g 
belongs to Xg and h is orthogonal to Xg- Since \\g\\jr < \\f\\jr < (tooo — wo)/a, 
the function g belongs to the ball of radius -\/ {rrirx, — mo) /a in Xg and therefore 
llff ^ ffill.?' — e/v^ for some gt from the e/V2-net. The definition of B implies 
that ||/i||3r < (tooo — 'ms)/a < The Pythagoras theorem yields 

11/ -5.113^=115 -.9.113^ + 11/^113^ 



Thus the net we have constructed is an e-net for B. 

Since the functions ft belong to a compact set, there is a converging sub- 
sequence /t^; let limfc^oo ft^ = foo- We have J2^iiyt - foo{.xt)f + a||/oo|l3r < 
moo- Indeed, if I]^i(yt - foo{xt)f + a||/oo|!3=- > "^oo then for a sufficiently 
large To we have Y^fliiVt - foo{xt)Y + a||/oo|i^ > TOoo- Since /t^ f^ we 
get ftkix) -> fao{x) for all a; e X and for sufficiently large k all ftki^t) are 
sufficiently close to fao{xt), t ~ 1,2,..., To and ||/tt||jF is sufficiently close to 

||/oo||jp so that YlJliiVt - ftAxt)Y + a||/tJl3^ > "loo- We have proved that 
under the conditions of Part 2 we have mt — -l-oo as t ^ oo. 

Take e > 0. Since by Lemma [1] we have dx 0, there is Tq such that for all 
T > To we have l + dT/a<l + e and 

i:(2/*-7rr=E(2/*-7rr+ E im-in' 

t=l t=l t=To + l 

To / T \ 

= E(y* - + (1 + ^) niin E(y* - /(^*))' + « ii/iiH • 

t=l ' \t=l / 

Therefore for all sufficiently large T the fraction in ([6]) does not exceed l+2£. □ 

Remark 1. The proof of compactness above is based on the following general 
result (cf. |BP70j . Chapter 4, exercise 7 on p. 172). Let B be a subset of l2- 
Then B is totally bounded if and only if there is a sequence of nonnegative 
numbers ai, 02, . . . > converging to 0, i.e., limf_>oo ott ~ 0, such that for every 
X = {xi, X2, ■ . ■) £ B and every t 1, 2, . . . the inequality X^i^t — 0^(0 holds. 
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This result generalises the well-known construction of the Hilbert cube (also 
known as the Hilbert brick). 

The corollary does not hold for a non-compact domain. Let us construct a 
counterexample. 

Let X be the unit ball in I2, i.e., X = {.t G Z2 | ||2^||;2 = !}■ Let the kernel 
on X be the scalar product in l2^ i.e., for u — (mi,W2, • ■ •) and v = (vi,V2, ■ ■ ■) 
from X we have JC{u, v) = (u, v)i2 = J^iZi UiVi. 

Consider the following sequence of elements Xt & X. Let X2i-i = X2i have 
one at position i and zeroes elsewhere, i = 1,2,... Consider the sequence of 
outcomes where odd elements equal 1 and even elements equal 0, i.e., ?/2i-i = 1 
and ?;2i = for i = 1, 2, . . . We get 
t xt Vt 



Fix a > 0. Let us work out the predictions 7^ output by on-line ridge 
regression on this sequence. The definition implies that 7]^^ — and 7!^^ = 
1/(1 + a). To obtain further predictions we need the following lemma stating 
that examples with signals orthogonal to all other signals and xq where we want 
to obtain a prediction can be dropped from the sample. 

Lemma 2. Let /C:XxX— >]R6ea kernel on a domain X; let S = 
{{xi,yi), (x2, 2/2), ■ • • , {xt, Vt)) ^ {X x M)* be a sample of pairs and let xq G X . 
If there is a subset {xi^,yi^), {xi^jiji^), . . . , {xi^,yi^) of S such that the signals of 
the examples from this subset are orthogonal w.r.t. K, to all other signals, i.e., 
lC{xi. , Xm) = for all j = 1,2, ... ,k and m 7^ ii,i2, . . . ,ik, o,nd orthogonal to 
Xq w.r.t. IC, i.e., lC{xi. , a;o) = for all j = 1, 2, . . . , fc, then all elements of this 
subset can be removed from the sample S without affecting the ridge regression 
prediction fnuixo). 

Proof. Let the subset coincide with the whole of S. Then k{xo) ~ and the 
ridge regression formula implies that ridge regression outputs 7 = 0. Dropping 
the whole sample S leads to the same prediction by definition. For the rest of 
the proof assume that the subset is proper. 

The main part of the proof relies on the optimality of ridge regression given 
by Proposition [21 Let J- be the RKHS of functions on X corresponding to IC. 
The ridge regression function for the sample S minimises X]t=i(/('^t) ~ Vt)^ ~^ 
a||/||3r and by the representer theorem (Proposition[3]) it is a linear combination 
of ]C{xt, i = 1,2, . . . ,T. 

Let us represent a linear combination / as /i -I- /2 , where /i is a linear combi- 
nation of IC{xi^ , ■), j — 1,2, ... ,k corresponding to signals from the subset and 
/2 is a linear combination of the remaining signals. The functions /i and /2 are 
orthogonal in and this representation is unique. For every j — 1,2, . . . ,k and 



1 (1,0,0, 

2 (1,0,0, 

3 (0,1,0, 

4 (0,1,0, 



1 


1 
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m ^ «i, 12, . . . ,ifc we have f{xi^) = fi{xi.) and f{xm) = f2{xm) and therefore 

T 

k 

E(/lK)-2^U'+ E (/2(^m)-2/™f +«ll/l|l^ + «ll/2||^ . 

j=l m^ii, 12,..., ik 

This expression sphts into two terms depending only on /i and f2- We can 
minimise it independently over /i and /2. Note that /i(a;o) = by assumption 
and therefore /RR(a;o) = f2ixo), where /2 minimises 

E (/(2;™)-2/m)'+a||/||?. 

over J-. The optimality property imphes that /2 is the ridge regression function 
for the smaller sample. □ 

The lemma implies that 72^-1 = 7i = and 72^ = 72 = 1/(1 + o.) for all 
j = 1, 2, ... It is easy to see that Corollary 2] is violated. For /o = wc have 

E'li(7«^^ - Vi? »(i + 1/(1 + «)^) _ ^ I 1 ^ ^ 
Etiifoi^t)-ytr ^ (i + «)' 

The actual minimiseJl gives an even smaller denominator and an even larger 
fraction. 

We have shown that compactness is necessary in Corollary It is easy to 
modify the counterexample to show that compactness without the continuity of 
JC is not sufficient. Indeed, take an arbitrary compact metric space X containing 
an infinite sequence xo,xi,X2, ■ ■ ■ where Xi 7^ Xj for i ^ j. Let ^ : X ^ I2 he 
such that Xi is mapped to Xi from the counterexample for every i = 1, 2, . . . and 
xo is mapped to 0. Define the kernel /C on X^ by IC{u, v) — ($(u), ^(w))/^ (this 
kernel cannot be continuous). Take yi as in the counterexample. All predictions 
and losses on (Ji, yi), {x2, 2/2), • • ■ will be as in the counterexample with IC{xo, ■) 
playing the part of /q. 

5 Gaussian Fields and a Proof of the Identity 

We will prove the identity by means of a probabilistic interpretation of ridge 
regression. 

■^It can easily be calculated but we do not really need it. 
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5.1 Probabilistic Interpretation 

Suppose that we have a Gaussian random ficlc0 with the means of and 
the covariances Q,Ciw[zg;^^z^^ = Kixi^x-^). Such a field exists. Indeed, for any 
finite set of Xi, a;2, . . . , our requirements imply the Gaussian distribution 
with the mean of and the covariance matrix of K. These distributions satisfy 
the consistency requirements and thus the Kolmogorov extension (or existence) 
theorem (see, e.g., |Lam77| . Appendix 1 for a proof sketclH) can be applied to 
construct a field over X. 

Let be a Gaussian field of mutually independent and independent of Zx 
random values with the mean of and variance . The existence of such a 
field can be shown using the same Kolmogorov theorem. Now let yx = + ^x- 
Intuitively, £x can be thought of as random noise introduced by measurements of 
the original field Zx- The field Zx is not observable directly and we can possibly 
obtain only the values of y^- 

The learning process can be thought of as estimating the values of the field yx 
given the values of the field at sample points. One can show that the conditional 
distribution of z^; given a sample S = {{xi,yi), {x2, j/2), • • • , {xt, Vt)) is Gaussian 
with the mean of -f™ ~ Y'{K + a'^I)~^k{x) and the variance dx = JC{x, x) — 
k'{x){K + a'^I)^^k{x). The conditional distribution of yx is Gaussian with the 
same mean and the variance +IC{x,x) — k' {x){K + a"^ I)^^ k{x) (see |RW06] . 
Section 2.2, p. 17). 

If we let a = (T^, we see that and a + dt are, respectively, the mean and 
the variance of the conditional distributions for yx^ given the sample St ■ 

Remark 2. Note that in the statement of the theorem there is no assumption 
that the signals xt are pairwise different. Some of them may coincide. In 
the probabilistic picture all xs must be different though, or the corresponding 
probabilities make no sense. This obstacle may be overcome in the following 
way. Let us replace the domain X hy X' = X x N, where N is the set of 
positive integers {1, 2, . . .}, and replace Xt by x[ ~ {xt,t) e X'. For X' there is 
a Gaussian field with the covariance function /C'((xi,ti), (0:2,^2)) = IC{xi,X2)- 
The argument concerning the probabilistic meaning of ridge regression stays for 
/C' on X' . We can thus assume that all Xt are different. 

The proof of the identity is based on the Gaussian field interpretation. Let us 
calculate the density of the joint distribution of the variables (yx-^ , yx2 : • • ■ j ) 
at the point (j/i, 2/2, . . ■ , 2/t). We will do this in three different ways: by decom- 
posing the density into a chain of conditional densities, marginalisation, and, 
finally, direct calculation. Each method will give us a different expression cor- 
responding to a term in the identity. Since all the three terms express the same 
density, they must be equal. 

''We use the term 'field' rather than 'process' to emphasise the fact that X is not necessarily 
a subset of R and its elements do not have to be moments of time; some textbooks still use 
the word 'process' in this case. 

® Strictly speaking, we do not need to construct the field for the whole X in order 
to prove the theorem; is suffices to consider a finite-dimensional Gaussian distribution of 
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5.2 Conditional Probabilities 

We have 



Py^cT {vt I Vxi = yi, = 2/2, • • • , VxT-i = vt-i)- 

Pv., ,!/x2.-.!/«T-l (2/1' 2/2, • • • , 2/T-l) • 

Expanding this further yields 
Pyxi,?/»2,...,yx^(2/i,2/2,---,2/T) = 

Py^T-iVT I Vxi =yi,yx2 =y2,---,yxT-i =2/t-i)- 

PyxT-iiVT-l I =2/l,yx2 =2/2,.-.,2/xT-2 =2/T-2)---Pj/,i(2/i) • 

As we have seen before, the distribution for y^jj given that = yi,yx2 = 
2/2, • • • , 2/a:t_i = 2/t-i is Gaussian with the mean of 7!^^ and the variance of 
dt + a-'^. Thus 

Pj/x((2/t I yxi =yi,yx2 =y2,---,yxt-i =2/t-i) = 

1 1 1 (Vt-7f"-)^ 



and 



e 



P!/xi,?/«2.-,ya,T (2/i,2/2,---,2/t) 



1 (^f^-Vt)^ 

e 2 z^t=i d(+„2 



(27r)^/2 ^(di+(72)(d2 + a2)...(dT + a2) 

5.3 Dealing with a Singular Kernel Matrix 

The expression for the second case looks particularly simple for non-singular K. 
Let us show that this is sufficient to prove the identity. 

All the terms in the identity are in fact continuous functions of T(T + l)/2 
values of K. at the pairs of points Xi,Xj, i, j = 1,2,...,T. Indeed, the values 
of 7^^^ in the left-hand side expression are ridge regression predictions given by 
respective analytic formula. Note that the coefficients of the inverse matrix are 
continuous functions of the original matrix. 

The optimal function minimising the second expression is in fact /rr(x) = 
J2t=i ctf^ixt, x), where the coefficients ct are continuous functions of the values 
of /C. The reproducing property implies that 

T T 

||/rr||^= X] CiCj(/C(xi,-),/C(a;j,-))jp = ^ CiCj}C{xi,Xj) . 
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We can thus conclude that all the expressions are continuous in the values 
of K. Consider the kernel lCa{xi,X2) — lC{xi,X2) + a5{xi,X2), where 5 is as 
in ^ and a > 0. Clearly, (5 is a kernel and thus /Cq is a kernel. If all Xt are 
different (recall Remark ^ , kernel matrix for JCa equals K + al and therefore 
it is non-singular. 

However the values of JCa tend to the corresponding values of /C as a — > 0. 
5.4 Marginalisation 

The method of marginalisation consists of introducing extra variables to ob- 
tain the joint density in some manageable form and then integrating over the 
extra variables to get rid of them. The variables we are going to consider are 

Given the values of z^i , , ■ ■ ■ , Zxj, , the density of t/^i , yx2 1 ■ • ■ j Vxt is easy 
to calculate. Indeed, given zs all ys are independent and have the means of 
corresponding zs and variances of cr^, i.e., 

Pyxi,ax2,---,yxr(yi>2/2, . . . ,?/T I 2^X1 = Zi,Zx^ = Z2,.. . ,ZxT_i = ZT-l) = 

1 1 1 (Vl-^lf 1 1 1 («2-»2)^ 1 1 _l{VT-fTf_ 



— e 2 7^ p 2 ^ 



2tt ct V27r cr VStt <J 



Since Zxi , z^^ , ■ ■ ■ , have a joint Gaussian distribution with the mean of and 
covariance matrix Kt, their density is given by 



Pz^^,z^^,...,z^^izi,Z2, ■■■,Zt) 



(27r)r/ Vdet 



where Z = {zi, Z2, ■ ■ ■ , zt)' , provided Kt is non-singular. 
Using 

f'!/xi,yx2,---,iyxT,^xi,Zx2,---,Zxy (yii 2/2, • • • , VTi Zi,Z2, . . . , Zt) = 

Pyxi,yx2,-,yxT,(2/i>y2,---,yT I Zx^ = zi,Zx2 = Z2,...,Zxt-i = ^T-i)- 

Pz^-^,z^^,...,z^^ izi,Z2, ...,Zt) 

and 

Pyxi,yx2,---,!yxT(yi'2^2, ■ ■ ■ .Vt) = 

/ Pyxi,!/x2,...,!/x^,Zxi,2x2,...,ZxT(yi'2/2, ■ • . ,yT,2i,Z2, • ■ .,ZT)dZ 

we get 



P!/xi,yx2,---,!yxT(yi'y2,---,2/T) = 
1 1 
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To evaluate the integral we need the following proposition (see |BB61| . Theo- 
rem 3 of Chapter 2) . 

Proposition 1. Let Q{0) he a quadratic form of 9 ^ M" with the positive- 
definite quadratic part, i.e., Q{9) = 9' AO + 9'b + c, where the matrix A is 
symmetric positive-definite. Then 



Vdetn 

where Oq = argminRi. Q. 

The quadratic part of the form in our integral has the matrix ^K^^ + 
and therefore 

Py.i,yx2,...,yxr(yi'y2, . . . ^yr) = 

1 



(27r)^a^VdetXT ^det(iX^i + ^I) 



We have 



^/detK^\ det i^Kj^^ 



—rl] = Wdet -I+tt^Kt 



2 ^ 2ct2 / \ \2 2a 



Let us deal with the minimum. We will link it to 



M = min 

The representer theorem (see Proposition[3]) implies that the minimum from the 
definition of M is achieved on a function of the form f{x) — X]t=i ctK,{xt, •). 
For the column vector Z[x) — {f{xi),f{x2), ■ • ■ , /(xt))' we have Z{x) = KtC, 
where C = (ci, C2, . . . , ct)'- Since Kt is supposed to be non-singular, there is 
a one-to-one correspondence between C and Z{x); we have C — K:^^Z{x) and 
= C'KtC = Z'{x)K:^^Z{x). We can minimise by Z instead of C and 
therefore 



mm 



\ t=i 



Z \ 2cr2 2 

\ 4=1 / 

For the density we get the expression 

P............ (2/1,2/2, . . . , 2/r) = (2,)T/Vdet(i^, + .2,) ^"^" 
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5.5 Direct Calculation 

One can easily calculate the covariances of ys: 

C0v(?/a;i,ya:2) = E{Zx^ + £x^){Zx^ + £^2 ) 
— K.{xi,X2) + <j'^S{xi,X2) ■ 

Therefore, one can write down the expression 

(2^)^/ Vdet(ifT + CT^/) 

5.6 Equating the Terms 

It remains to take the logarithms of the densities calculated in different ways. 
We need the following matrix lemma. 

Lemma 3. 

(di + a^){d2 + a^)... {dr + <J^) = det{KT + a^I) 
Proof. The lemma follows from Frobenius's identity (see, e.g., [HS81j ): 

det (^^, ^{d- v'A-^u) det A , 

where d is a scalar and the submatrix A is non-singular. 
We have 

det(is:T + cr^I) = {JC{xt,xt) + <7^- k'rr_i{xT)(KT-i + (J^I)-^kT-i{xT))- 
dei{KT-i + 0-2/) 
= {dT + a^) Aei{KT-i + cr^/) 

= {dr + (T^){dT-i + 0-2) . . . {,d2 + cr^)(di + cr^) . 

□ 

We get 

t = l ^ 

The theorem follows. 
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6 Bayesian Merging Algorithm and an Alterna- 
tive Proof of the Identity 



In this section we reproduce an alternative way (after |ZV09j ) of obtaining the 
identity. 

An advantage of this approach is that we do not need to consider random 
fields. The use of probability is minimal; all probabilities in this approach are 
no more than weights or predictions. This provides an additional intuition to 
the proof. 

6.1 Prediction with Expert Advice 

Consider the standard prediction with expert advice framework. Let outcomes 
2/1,2/2, ■• ■ from an outcome set occur successively in discrete time. A learner 
tries to predict each outcome and outputs a prediction 74 from a prediction set T 
each time before it sees the outcome yt- There is also a pool Q of experts; experts 
try to predict the outcomes from the same sequence and their predictions are 
made available to the learner. The quality of predictions is assessed by means 
of a loss function A : F x — > [0, 4-cxd]. 

The framework can be summarised in the following protocol: 

Protocol 2. 

for i = 1,2,... 

experts 9 E announce predictions 7^ G F 

learner outputs 7^ G F 

reality announces j/t G 

each expert 0G0 suffers loss X{'ff,yt) 

learner suffers loss X{"ft,yt) 
endf or 

The goal of the learner in this framework is to suffer the cumulative loss 
Losst = X^tLi -^(7*, yt) not much larger than the cumulative loss of each expert 

In this paper we consider the game with the outcome set = M and the 
prediction set F of all continuous density functions on M, i.e., continuous func- 
tions : M — )• [0, -|-C!o) such that /J^^ S,{y)dy = 1. The loss function is negative 
logarithmic likelihood, i.e., X{£,,y) = — ln^(j/). 

6.2 Bayesian Merging Algorithm 

Consider the following merging algorithm for the learner. The algorithm takes 
an initial distribution Pq on the pool of experts O as a parameter and maintains 
weights Pt for experts 6. 

Protocol 3. 

let Pg* = Pa 
for i = 1,2,... 
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read experts' predictions i^f € F, E Q 
predict 6 = /e^'^;-iW 
read yt 

update the weights Pt{d0) = {yt)Pt-i{d9) 
normalise the weights P^{de) = Pt{d6)/ ^^Pt{de) 
endf or 

If we consider an expert as a probabilistic hypothesis, this algorithm be- 
comes the Bayesian strategy for merging hypotheses. The weights Pj* relate to 
Pi_i as posterior probabilities to prior probabilities assigned to the hypotheses. 
We will refer to the algorithm as the Bayesian Algorithm (BA). 

The algorithm can also be considered as a special case of the Aggregating 
Algorithm r |Vov90irVov98] . see also [VovOlj ) going back to [DMW88] . It is easy 
to check that the Aggregating Algorithm for these outcome set, prediction set, 
and the loss function and the learning rate -q — 1 reduces to Protocol[3l However 
we will not be using the results proved for the Aggregating Algorithm in this 
paper. 

After t steps the weights become 

P^{de) = e'^°''''*''-^^ PoidO) . (7) 

The following lemma is a special case of Lemma 1 in Vovk |Vov01| . It shows 
that the cumulative loss of the BA is an average of the experts' cumulative losses 
in a generalised sense (as in, e.g.. Chapter 3 of f HLP52j ). 

Lemma 4. For any prior Pq o^nd any t — 1,2,..., the cumulative loss of the 
BA can he expressed as 

Losst = ~ln/" e-L°""*(^)Po(d6i). 

Proof. The proof is by induction on t. For t = the equality is obvious and for 
t > we have 

Losst = Losst_i - ln^t(7;t) = 

= -\n[ e-(-i"«'(j«)+i^°""'-iW)Po(d6l) = -In /" er^"'"''''-^^ Pn{d0) 
Je Je 

(the second equality follows from the inductive assumption, the definition of , 
and O). □ 

6.3 Linear Ridge Regression as a Mixture 

The above protocols can incorporate signals as in Protocol [1] Indeed let the 
reality announce a signal Xt on each step t; the signal can be used by both the 
experts and the learner. 
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Suppose that signals come from R". Take a pool of Gaussian experts G = M". 
Fix some a > and let expert output the density of Gaussian distribution 
A/'(0'xt,a2), i.e., 

^f(y) = -==e ^ , (8) 



V2^2 
on step t. 

Let us assume the multivariate Gaussian distribution A/'(0, /) with the den- 
sity 

as the initial distribution over the pool of experts. We will show that the learner 
using the Bayesian merging algorithm with this initial distribution will be out- 
putting a Gaussian density with the mean of the ridge regression prediction. 
Note that there is no assumption on the mechanism generating outcomes yt- 

Let Yt be the vector of outcomes yi, t/2, • ■ • , 2/t- Let Xt be the design matrix 
made up of column vectors xi,X2-, ■ ■ ■ ^Xt and At = XtX[ + a^I, i = 1, 2, . . .. 

Lemma 5. The learner using the Bayesian merging algorithm with the initial 
distribution (0) on the pool of experts M" predicting according to (0) will he 
outputting on step T = 1,2,... the density 



where 

^RR = Y^_^X^^^A^l,XT 
— a'^x'rpA'j. -^XT + cr"^ ■ 

We have 7^^^ = {9^^)' xt, where 9^^ = A-j,\Xt-iYt-i. At 9^^ the min- 
imum mineeRii {(^' — ytY + o'^||6'|p^ is achieved. This can be checked 

directly by differentiation or by reducing to Proposition [2] (see Subsection 16.51 
for a discussion of linear ridge regression as a special case of kernel ridge re- 
gression). We will refer to the function {9^^^ x as the linear ridge regression 
with the parameter . We are considering the on-line mode, but linear ridge 
regression can also be applied in the batch mode just like the general kernel 
ridge regression. 

Let us prove the lemma. 

Proof. To evaluate the integral 

^t{v)= ( e^iv)P*^,{d9) (10) 
we will use a probabilistic interpretation. 
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Let be a random value distributed according to Pt^i, i-S-, having the 
density 

Peyu) ^ e 2;?2 ^t=i ^ * mi 211 n 

Clearly, 9 has a multivariate Gaussian distribution. The mean of a Gaussian 
distribution coincides with its mode and thus the mean of 9 equals 



fT-l 



9«« 



argmin | ^{u' Xt - ytf + a^\\uf ] ^ Aj}_^Xt-iYt-i 
The covariance matrix 



V t=i 



can be obtained by singling out the quadratic part of the quadratic form in u. 

Let y be a random value given hy y — 9' xt + £^ where e is independent of 9 
and has a Gaussian distribution with the mean of and variance of cr^. Clearly, 
given that 9 = u, the distribution of y is N {u' xt , cr"^) ■ The marginal density of 
y is just ^t(v) we need to evaluate. 

We will use the following statement from jBis06] . Section 2.3.3. Let 77 have 
the (multivariate) Gaussian distribution and Q have the (multivari- 

ate) Gaussian distribution M{Arj + 6, L^^), where A is a fixed matrix and 6 is a 
fixed vector. Then the marginal distribution of C, is N{A(x + h, + AA~^A'). 

We get that the mean of y is x'rp9^^ and the variance is cr^ + x'j^a'^A^^^XT- 

□ 

The lemma is essentially equivalent to the following statement from Bayesian 
statistics. Let yt = x[9 + St, where Et are independent Gaussian values with 
the means of and variances a^, and xt are not stochastic. Let the prior 
distribution for 9 be Af{0, 1)- Then the distribution for yr given the observations 
xi, yi,X2, y2, ■ ■ . ,XT-i,yT-i,XT is A/'(7^^, (Tt); see, e.g., |Bis06) , Section 3.3.2 
or IHKOOj . 



6.4 The Identity in the Linear Case 

The following theorem is a special case of Theorem [1] 

Theorem 2. Take a > 0. For a sample (xi, ?/i), (2:2, 2/2), • ■ • , (a^T, J/t), where 
Xi,X2, ■ ■ ■ ,xt e K" and yi, 2/2, • ■ • , 2/t & R", let 7^ ^, 7^^, . . . , 7^^ be the pre- 
dictions output by linear ridge regression with the parameter a in the on-line 
mode. Then 

E { = min y^{9'xt - ytf + a\\9f = aYJ^iX'^Xr + aiy^Yr , 

^^\ + x^A,^^xt ee«"V^ / 

where At — 'Yl!^i=i^'i^i + clI = X[Xt, Xt is the design matrix consisting of 
column vectors xi, X2, . . . , xt, and Yt — {yi, y2, • • • , yt)' ■ 
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Proof. We start by showing that the first two terms are equal and then proceed 
to the last term. 

Consider the pool of Gaussian experts with the variance — a and the 
learner following the Bayesian merging algorithm with the initial distribution 
Af{0, T) on the experts. 

It follows from Lemma [SJ that the total loss of the learner over T steps is 
given by 

Losst = - V In e ""'i (11) 

t=i V2^'^f 

= f2^Ji-^+lnflat + ^H2n) , (12) 

t=i * t=i 

where = (t^(1 + x[A^}^Xt). 
Lemma U implies that 

- - (m^W L ^--^"■"'--'■-*"-^<' 

It follows from Proposition [T] that the integral evaluates to 

n /2 



e 2, 



Vdet(AT/(2a2)) 
^ mi„,e,„ (ELi(e'^t-2/t)H'^' lief) (27r)"/^ 



Vdet(AT/a2) 
and thus 



Losst - 



(13) 



T 1 At 

- ln(27r) + T In (7 + - In det ^ 



Let us equate the expressions for the loss provided by ([T2|) and (|T3|). To 
prove the identity we need to show that 

t=i 

This equality follows from the lemma. 

Lemma 6. For any a > and positive integer T we have 

A ^ 
det — = [](l + x;A-\xt) , 

" t=i 

where At = J2l=i ^'i^i + ■ 
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Proof. We will use the matrix determinant lemma 

det{A + uv') = (1 + v'A-\) det A , 

which holds for any non-singular square matrix A and vectors u and v (see, e.g. 
|Har97j . Theorem 18.1.1). We get 

At 1 

det ~ — det( At -1 + xtx't) 

a a" 

= ^ det(AT-i)(l + XtA-;j}_^xt) 



1 ^ 

— dct(a/) TT(1 + x[At-iXt) 
a" 

T 



□ 



Remark 3. The lemma is in fact a special case of Lemma[3]with the linear kernel 
/C(ui,U2) — u[u2 and a = . As shown in Subsection 16.51 dt/a = x'iA'^\xt. 
The Sylvester identity implies that det(a/ + _fi'T) = <:\el{al + XlpXT) ~ det(a/ + 
XtX'j,) = det At. 

We have shown that the left-hand side and the middle terms in the identity 
are equal. Let us proceed to the equality between the middle and the right-hand 
side terms. 

The minimum in the middle term is achieved on Oj.^^ = A^^XtYt = 
{XtXIj. al)^^ XtYt as shown in Subsection 16.31 Using Lemma [7] we can 
also write 0^^-^ = XT{X'rpXT + aI)^^YT. The proof is by direct substitution of 
these expressions for 0^^^ . We have 



2 

2 



^ \t=l / t=l 

{XtX't + a/)(??5i - 2 (fl^i^i)' XtFt + YtYt ■ 

Substituting the first expression for the second appearance of O^^i and can- 
celling out XtX'j, + al we get 

M = {-e^f-^XT + Y^)Yt . 

Substituting the second expression for Ot^i yields 

M = Y^{-{X!tXt + aiy^X^XT + I)Yt 

It remains to carry {X^Xt + al)~^ out of the brackets and cancel out the 
remaining terms. □ 
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6.5 Kernelisation 

Let us derive Theorem [T] from Theorem [5] 

First, let us show that Theorem [5] is reaUy a special case of Theorem [T] for 
the linear kernel )C{xi,X2) = x'iX2- We will consider the identity term by term. 
By Lemma [7] the prediction output by linear ridge regression on step t equals 

YU{Kt^i+aI)-^k{xt) . 

For the linear kernel the expression dt/a in the denominator of the identity can 
be rewritten as follows: 

— = - [K.{xt,xt) - k[_^{xt){Kt-i + aiy^kt^i{xt)\ 
a a 

= - [x[xt ~ + aI)-\X[_^xt)\ . 

a 

We can apply Lemma [7] and further obtain 

^ = i [^xt - x[{Xt-iX't_^ + aI)-'Xt-iX't_^Xt] 

a 

^x[(Xt-iX[_^ + aI)-^Xt (14) 
— ^'t^t-i^t ■ 

Let us proceed to the middle term in the identity. The set of functions 
fe{x) = 9'x on R" with the scalar product (/eu/sa) — ^'1^2 is a Hilbert space. 
It contains all functions /C(u, •) — /„ and the reproducing property for K. holds: 
{fg, JC{x, ■)) = {fe,fx) — O'x — fe{x)- The minimum in the middle term of 
Theorem [5] is thus the same as in the middle term of Theorem [T] 

For the right-hand side term the equality is obvious. 

Now take an arbitrary kernel /C on a domain X and let F be the corre- 
sponding RKHS. We will apply a standard kernel trick. Consider a sample 
(a;i,yi), (x2,?/2), ■ • ■ , (a;T,yT), where Xt ^ X and yt e M, i = 1, 2, . . . , T. It 
follows from the representer theorem (see Proposition [3]) that the minimum 
in the middle term is achieved on a linear combination of the form /(•) = 
^i^i CtfC{xt, ■), where ci,C2,...,ct G K.. These linear combinations form a 
finite-dimensional subspace in the RKHS Let ei, 62, . . . , em, < T, be its 
orthonormal base and let C map each linear combination / into the (column) 
vector of its coordinates in ei, 62, . . . , em. Since the base is orthonormal, the 
scalar product does not change and (/i, /2)jr — (C(/i))'C(/2). The reproducing 
property implies that 

f{xt) = {JM^u-))r={C{f)yC{lC{xu-)) 
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for < = 1, 2, . . . , T. We also have 



K:{Xi,Xj) = (/C(X,;, ■),IC{Xj,-))jr = {C{IC{Xi, ■)))'€ {IC{Xj,-) , 



i,j = 1, 2, . . . , T. Note that C is a surjection: each e M™ is an image of some 
linear combination /. 

Consider the sample (ii, yi), (x2, 2/2), • ■ • , (St, J/t), where Xt = C{IC{xt, •)) G 
M"*, t = 1, 2, . . . , T. Clearly, linear ridge regression in the on-line mode outputs 
the same predictions on this sample as the kernel ridge regression on the original 
sample and {xi,Xj) = ]C{xi,Xj). The minimum from Theorem [1] on the original 
sample clearly coincides with the minimum from Theorem[2]on the new sample. 

Theorem [T] follows. 

Appendix A. Optimality of Kernel Ridge Regres- 
sion 

In this appendix we derive the optimality property for the ridge regression 
function function /rr,. 

Proposition 2. Let K, : X x X ^ M. be a kernel on a domain X and T he the 
corresponding RKHS. For every non-negative integer T , every xi,X2, ■ ■ ■ ,xt S 
X and j/i, 2/2, ■ • ■ , 2/T G Hi, Bind every a > the minimum 



is achieved on the unique function fBJi{x) — Y'{al+K) ^k{x) forT > 0, where 
Y , K , and k{x) are as in Suhsection \2.'A and fKR.{x) — identically for T = 

Proof. If T — 0, i.e., the initial sample is empty, the sum in (jlSp contains no 
terms and the minimum is achieved on the unique function / with the norm 
||/||jr — 0. This function is identically equal to zero and it coincides with /rr 
for this case by definition. For the rest of the proof assume T > 0. 

The representer theorem (see Proposition ^ implies that every minimum 
in (1151) is achieved on a linear combination of the form /(•) = X]t=i CtlC{xt, ■)■ 

The minimum in (|15p thus can be taken over a finite-dimensional space. As 
\\f\\j^ 00, the expression tends to +00, and thus the minimum can be taken 
over a bounded set of functions. The value f{x) = (/, IC{x, is continuous in 
/ for every x G X. Therefore we are minimising a continuous function over a 
bounded set in a finite-dimensional space. The minimum must be achieved on 
some /. 

Let C = (ci, C2, . . . , ct)' be the vector of coefficients of some optimal func- 
tion f{x) = Y^J^iCtJC{xt,x) — C'k{x). It is easy to see that the vector 




(15) 
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{f{xi), f{x2), • • • , f{xT))' of values of / equals KC and 

T 

11/11^ = CiCj{IC{xi,-),IC{xj,-)}j. 
= C'KC . 

Thus 

T 

Y,{f{x) - Vtf + aWfWl = \\KC - Yf + aC'KC 
t=i 

= C'K'^C-2Y'KC+\\Yf +aC'KC . 

Since / is optimal, the derivative over C must vanish. By differentiation we 
obtain 

2K'^C - 2KY + 2aKC = 

and 

K{K + aI)C = KY . 

Hence 

{K + aI)C = Y + v 

and 

C ={K + aI)-^Y + {K + aiy^v , 

where v belongs to the null space of K, i.e., Kv = 0. 

Let us show that K{K + al)~^v = 0. We need a simple matrix identity; as 
it occurs in this paper quite often, we formulate it explicitly. 

Lemma 7. For any (not necessarily square) matrices A and B and any constant 
a the identity 

A{BA + al)-^ = {AB + aiy'^A 

holds provided the inversions can be performed. If B = A! and a > 0, the 
matrices AB + al and BA + al are both positive-definite and therefore non- 
singular. 

Proof We have ABA + aA = A{BA + al) = {AB + aI)A. If AB + al and 
BA + al are invertible, we can multiply the equality by the inverses. □ 

We get K{K + al)-^v = {K + aI)-^Kv = 0. Therefore C has the form 
C={K + aI)-^Y + u, where Ku = 0. 

Consider the function fu{x) = u'k{x). It is a linear combination of K,{xi, •). 
On the other hand, it vanishes at every xt, t = 1, 2, . . . , T because Ku = 0. We 
have 

= /„(xt) = (/,/C(xt,-)b 
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and thus /„ is orthogonal to the space of hnear combinations. This is only 
possible if /„ = 0. 

Thus the minimum can only be achieved on a unique function that can 
be represented as /RR(a;) — Y'{K + al)^^k{x). Since it must be achieved 
somewhere, it is achieved on /rr. □ 



Appendix B. Representer Theorem 

In this appendix we formulate and prove a version of the reproducing prop- 
erty for RKHSs. See [SHSOll for more details including a history of the theorem. 
Proposition 3. Let K, be a kernel on a domain X , J- he the corresponding 
RKHS and (a;i, yi), (x2, 2/2), ■ . ■ , (a^T, 2/t) be a sample such that xt € X and 
yt G K, t = 1,2, ... ,r. Then for every f G J- there is a linear combination 
/(•) — X^tLi '^tl^{xt, ■) & such that 

T T 

Y.Cf{xt)-y.f<Y.{f{xt)-y.f 
t=i t=i 

and \\f\\j^ < II/IIj^- f '^ot itself a linear combination of this type, there is a 
linear combination f with this property such that \\f\\jr < 

Proof. The linear combinations of /C(xi, •) form a finite-dimensional (and there- 
fore closed) subspace in the Hilbert space J". Every f G T can be represented as 
/ = h+g, where /i is a linear combination and g is orthogonal to the subspace of 
linear combinations. For every t = 1, 2, . . . , T we have g{xt) — (5, /C(xt, = 
and the values of / and h on xi,X2, ■ ■ ■ ,xt coincide. On the other hand, the 
Pythagoras theorem implies that ||/||3r = \\h\\jr + \\g\\jr > \\h\\jr; if g ^ 0, the 
inequality is strict. □ 



Appendix C. An Upper Bound on a Determinant 

In this appendix we reproduce an upper bound from [CBCG05) . 
Proposition 4. Let the columns of anxT matrix X be vectors xi,X2, ■ ■ ■ ,xt G 
R" anda> 0. < B , t = 1,2, . . . ,T , then 

det(l+ -XX'\ = dct(l+ -X'x\ < (1 + ^-^^ . 
\ a J \ a J \ an J 

Proof. Let Ai, A2, . . . , A„ > be the eigenvalues (counting multiplicities) of the 
symmetric positive-definite matrix XX'. The eigenvalues of /+ -XX' are then 
1 + Ai/a, 1 + A2/a, . . . , 1 + A„/a and det(/ + ^XX') = UtiC^ + v)- 

The sum of eigenvalues Ai -I- A2 -I- . . . -I- A„ equals the trace tr{XX') and 
ti(XX') = ti{X'X). Indeed, the matrices AB and BA (provided they exist) 
have the same non-zero eigenvalues counting multiplicities while zero eigenval- 
ues do not contribute to the trace. Alternatively one can verify the equality 
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tT{AB) = tr{BA) by a straightforward calculation, see, e.g., |Axl97) , Proposi- 
tion 10.9 (p. 219). The matrix X'X is the Gram matrix of vectors xi,X2, ■ ■ ■ ,xt 
and the elements on its diagonal are the squared quadratic norms of the vectors 
not exceeding B^. We get tr(XX') = tr(X'X) < TB^. 

The problem has reduced to obtaining an upper bound on the product of 
some positive numbers with a known sum. The inequality of arithmetic and 
geometric means implies that 




Combining this with the bound on the trace obtained earlier proves the lemma. 

□ 



Appendix D. A lemma about partitioned matri- 
ces 

In this appendix we formulate and prove a matrix lemma for the proof of 
Lemma [T] 

Lemma 8. If a symmetric positive- definite matrix M is partitioned as 

^={b' D 

where the A and D are square matrices, and a column vector x is partitioned as 

u 

V 

where u is of the same height as A, then x' M~^x > u' A~^u > 0. 

Proof. We shall rely on the following formula for inverting a partitioned matrix: 
if 

P 



M = 

then the inverse can be written as 

M-i = 



R S 



P Q 
R S 



where 



P = p-i + P-^Q{S - RP-^Q)-^RP-^ 

Q = -P-^QiS - RP-^Q)-^ , 
R = -{S - RP-^Q)-^RP-^ , 
S ^ {S - RP-^Q)-^ , 
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provided all the inverses exist (see |PTVF07| . Section 2.7.4, equation (2.7.25)). 
Applying these formulae to our partitioning of M we get 

-E-^B'A-^ E-^ 



AT 



where E^ D- B'A-^B. 

The matrix A is symmetric positive-definite as a minor of a symmetric 
positive-definite matrix; therefore it is non-singular. Non-singularity of E fol- 
lows from the identity 

det M = detPdet(S'- i?p-iQ) , 

where M and its blocks are as above (see [PTVF07] . Section 2.7.4, equa- 
tion (2.7.26) and |HJ85) . Section 0.8.5; the matrix S — RP~^Q is known as 
the Schur complement of P). Applying this identity to our matrices yields 

det M = det A det E 

and since both M and A are non-singular, E is also non-singular. This justifies 
the use of the formula for the inverse of a partitioned matrix in this case. 

Note also that E~^ is symmetric and positive-definite as a minor of a sym- 
metric positive-definite matrix M~^. 

We can now write 

x'Mx = u'A-^u + u'A-^BE-^B'A-\i - 2u'A-^BE-^v + v'E-^v 

(since u' A^^ BE^^v is a number, it equals its transpose). The first term in the 
sum is just what we need for the statement of the lemma. Let us show that the 
sum of the remaining three terms is non-negative. Let w = B'A~^u. We have 

u'A-^BE-^B'A-^u - 2u'A-^BE-^v + v'E-^v = 

w'E-^w - 2w'E-^v + v'E-^v = {w' v') (^_^_\ 'e'^^ {1 

To complete the proof, we need the following simple lemma. 

Lemma 9. // a matrix H is symmetric positive-semidefinite, then the matrix 

H -H 
-H H 

is also symmetric positive-semidefinite. 

Proof. We will rely on the following criterion. A symmetric matrix H is positive- 
semidefinite if and only if it is has a symmetric square root L such that H = 
(the if part is trivial and the only if part can be proven by considering the 
orthonormal base where H diagonalises) . We have 



□ 
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Thus 



> 



□ 
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